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Abstract: In this paper, we address the speaker independent 
recognition of Chinese number speeches 0-9 based on HMM. 
Our former results of inside and outside testing achieved 
92.5% and 76.79% respectively. To improve further the 
performance, two important features of speech; MFCC and 
cluster number of vector quantification, are unified together 
and evaluated on various values. The best performance 
achieve 96.2% and 83.1% on MFCC Number = 20 and VQ 
clustering number = 64. 
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I. Introduction 

In Speech processing, automatic speech recognition 
(ASR) is capable automatically of understanding the input 
of human speech for the text output with various 
vocabularies. ASR can be applied in a wide range of 
applications, such as: human interface design, speech 
Information Retrieval (SIR) [11,12], language translation, 
and so on. In real world, there are several commercial 
ASR systems, for example, IBM's Via Voice, Mandarin 
Dictation System-the Golden Mandarin (III) of NTU in 
Taiwan, Voice Portal on Internet and 104 on-line speech 
queries systems. Modern ASR technologies merged the 
signal process, pattern recognition, network and 
telecommunication into a unified framework. Such 
architecture can be expanded into broad domains of 
services, such as e-commerce and wireless speech system 
ofWiMAX. 

The approaches adopted on ASR can be categorized as: 
l)Hidden Markov Model (HMM) [1,2,3,4], 2)Neural 
Networks [5,6,7], 3) Wavelet-based and spectrum coefficients 
of speech [15,16], other method is the combination of first 
two approaches above [8,9]. The Hidden Markov Model is 
a result of the attempt to model the speech generation 
statistically, and thus belongs to the first category above. 
During the past several years it has become the most 
successful speech model used in ASR. The main reason 
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for this success is the powerful ability to characterize the 
speech signal in a mathematically tractable way. 

In a typical ASR system based on HMM, the HMM 
stage is proceeded by the parameter extraction. Thus the 
input to the HMM is a discrete time sequence of parameter 
vectors, which will be supplied to the HMM. 

In the paper, the following sections are organized as 
follow: the process of speeches is introduced in Section 2 
and the acoustic model of recognition will be described in 
Section 3. The initial results for former approaches are 
presented in Section 4. The improvement metods are 
furthermore described in Section 5 

II. Processes of Speech 

In this section, we will describe all the procedures for 
pre-processes. 

A. Processing Speech 

The analog voice signals are recorded thru 
microphone. It should be digitalized and quantified. The 
digital signal process can be described as follows: 

x p (0 = x a (Op(0 

(1) 

where x p (t) and x a (t) denote the processed and analog 
signal. p(t) is the impulse signal. 

Each signal should be segmented into several short 
frames of speech which contain a time series signal. The 
features of each frame are extracted for further processes. 

B. Pre-emphasis 

Basically, the purpose of pre-emphasis is to increase, 
the magnitude of some (usually higher) frequencies with 
respect to the magnitude of other (usually lower) 
frequencies in order to improve the overall signal-to-noise 
ratio (SNR) by minimizing the adverse effects of such 
phenomena as attenuation distortion. 

C. Frame Blocking 

While analyzing audio signals, we usually adopt the 
method of short-term analysis because most audio signals 
are relatively stable within a short period of time. Usually, 
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the signal will be segmented into time frame, say 15-30 
ms. 

D. Hamming Window 

In signal processing, the window function is 

a function that is zero-valued outside of some 

chosen interval. The Hamming window is a weighted 

moving average transformation used to smooth the 
periodogram values. 

Supposed that original signal s(n) is as follows: 

s(n),n = 0,...N-l ' ^ (2) 

The original signal s(n) is multiplied by hamming 
window w(n), we will obtain s(n)* w(n), w(n) can be 
defined as follows: 

w{n) = (1 - a) - a*cos(27m/(AM)), 0^ ni- N-l (3) 
where TV denotes the sample number in a window. 

E. Mel-frequency cepstral coefficients 

Mel Frequency Cepstral Coefficient (MFCC) is one of 
the most effective feature parameter in speech recognition. 
For speech representation, it is well known that MFCC 
parameters appear to be more effective than power 
spectrum based features. MFCCs are based on the human 
ears' non-linear frequency characteristic and perform a 
high recognition rate in practical application. 

o lower frequency, human hear more acute. 

o higher frequency, human hear less acute. 
As shown in Fig. 7, MFCC are presented as: 
mel(f)=\ 125*ln(l+/7700) (4) 

III. Acoustic Model of Recognition 

A. Vector Quantification 

Foundational vector quantifications (VQ) were 
proposed by Y. Linde, A. Buzo, and R. Gray in 1980, So- 
called LBG algorithm. LBG is based on k-means 
clustering [2,5], referring to the size of codebook G, 
training vectors will be categorized into G groups. The 
centroid Ci of each Gi will be the representative for such 
vector of codeword. In principal, the category is tree 
based structure. 

B. Hidden Markov Model 

A Hidden Markov Model (HMM) is a statistical model 
in which is assumed to be a Markov process with 
unknown parameters. The challenge is to find all the 
appropriate hidden parameters from the observable states. 
HMM can be considered as the simplest dynamic 
Bayesian network. 



In a regular Markov model, the state is directly visible 
to the observer, and therefore the state transition 
probabilities are the only parameters. However, in a 
hidden Markov model, the state is not directly visible (so- 
called hidden), while the variables influenced by the state 
are visible. Each state has a probability distribution over 
the output. Therefore, the sequence of tokens generated by 
an HMM gives some information about the sequence of 
states. 

A complete HMM can be defined as follows: 
X = (X, A, B) (5) 
HMM model can be defined as ( n , A, B) : 

1 . II (Initial state probability): 

x = {x- = prob(q l = S { )} 1 < i < N (6) 

2. A (State transition probability): 

A = {a .. = prob(q t + , = S i | q t = S , )} ^ 

1 < i < N 

3. B (Observation symbol probability): 

B = {b . (O t ) = prob(O t | q t = Sj )} 1 < i < N (8) 
where O = {O l9 O 2 O T } is the observation. 

S = {S l , S 2 , S 3 , , S N } is state symbols and 

q = {q 19 q 2 ,q 3 , ,q T } is observation states and 

T denote the length of observation, TV is the number of 
states. 

C. System Models 

The recognition system is composed of two main 
functions: 1) extracting the speech features, including 
frame blocking, VQ, and so on, 2) constructing the model 
and recognition based on the HMM, VQ and Viterbi 
Algorithm. 

It is apparent that short speech signal varied sharply 
and rapidly, whereas longer signal varied slowly. 
Therefore, we use the dynamic frame blocking rather than 
fixed frame for different experiments. 

IV. Initial Experiments 

A. Recognition System Based on HMM 

In the paper, we focus on speaker independent 
speech recognition of Chinese number speeches 0-9. All 
the samples with 44100 Hz/ 16 bits are recorded by three 
native male adults. Total 560 samples are divided into two 
parts, 280 for training and 280 for testing. After complete 
the pre-process, such as preemphasis, frame boloking, VQ. 

B. Comparison for fixed and Dynamic Frame Size 

According to our empirical results, comparing the 
fixed and dynamic frame size, recognition rate of fixed 
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frame size achieves 76.79%, and superior to the other 
with75.71%, as shown in Table 1. 

Table 1: comparing the frame size, (SymbolNum=64) 





wave 
Num 


Mfcc 
time 


VQ 
time 


HMM 
training 


Symbol 
Num 


rate(%) 


fixed 


I 


280 


32.9 


5.77 


3.44 


64 


90.36 


O 


280 


76.79* 


dynamic 


I 


280 


32.0 


3.31 


2.42 


64 


92.50* 


o 


280 


75.71 



PS. I and O denote the inside and outside testing, respectively 

V. Further Improvement 

A. Improving the Samples of Speech 

According to our empirical results, recognition rate 
achieve better results while cluster number=64. Inside and 
outside testing are 92.5% and 76.79%, respectively. 

To improve the performance, we analyze all the 
speech wavelet. There are many samples affected by boost 
noise derived from human speaking or environment, as 
shown in Fig. 1. In such a situation, the end points of 
boosted speech cannot be usually detected correctly. It 
will lead to degrade the performance of system. 

Usually, detecting end points judged on ZCR and 
energy of speech, as shown in Fig. 1. However, it is 
significant that we need extra features to detect for noise 
situation. Based on experimental results and observation, 
the improvement rules are summarized as follows: 

Input: X(n) , n = 1 to j 
Output: Y(m),l <= m <= j 

1. segment the speech X(n): framedY = framed (X(n)) 

2. calculate the ZCR and energy for each frame. 

3. smooth the curves for both ZCR and energy 

4. calculate the average of first 10 frames, and 
multiplying 1.2. The average value will be used as 
the threshold for detecting process. 

5. ZCR is valid only if framedY is larger than 100, as 
shown in Fig. 2. 

6. the speech will be effective only if the size is larger 
than 3 ms. 

7. the starting energy of speech should be larger than 
threshold. 

8. the energy for continuous 5 frames of speech 
should be increased progressively. 

Referring to the improvement, the speeches number 8 

Y) with boost noise can be detected, as shown in Fig. 

2. The improvement of detection will leads to better 
results for following recognition process. 
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B. Better Combination of Various Features 

To improve furthermore the performance, two spectrum 
features, MFCC and cluster number, of speeches are 
unified and evaluated. MFCC degree varied from 8 to 36 
with interval 4 and cluster number varied on 32 to 256 
with interval 32. We evaluated all the combination for 
these two features with various numbers. The process 
times needed for computation are shown in Table 2. The 
best results can achieve on MFCC Number= 20 and VQ 
clustering number = 64. The inside and outside testing of 
recognition achieve 96.2% and 83. 1% shown in Fig. 3 and 
net results for inside and outside testing are 3.7% and 
6.3%) respectively. We just list the results with VQ = 64 in 
the paper. 

Table 2: processed time with VQ = 64. 



MFCC 
degree 


8 


12 


16 


20 


24 


28 


32 


36 


MFCC 


15.8 


16.9 


18.6 


23.5 


25.3 


27.2 


28.5 


29.9 


VQ 


1.0 


2.6 


3.3 


3.4 


3.8 


4.9 


5.3 


6.6 


HMM 


1.7 


1.7 


1.8 


1.8 


1.8 


1.8 


1.9 


1.9 



" J - -50 
10000 
5000 
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Fig. 1 : before improvement, Chinese number 8 ( 'l Y ) 
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- Inside Test(%) 
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MFC C degree 



Fig. 3: performance with VQ = 64, MFCC degrees varied between 8 and 

36. 

VI. Conclusion 

In this paper, we address the speaker independent 
speech recognition of Chinese number speeches based on 
HMM. The algorithm for our novel approach is proposed 
for the speech recognition. 480 speech samples are 
recorded and pre-processed. The preliminary results of 
outside testing achieve 76.79%. 

To improve furthermore the performance, two 
features of speeches; MFCC and VQ cluster number, are 
evaluated. We then find the combination of two spectrum 
features to achieve best results. The best performance will 
be achieved on MFCC, Number = 20 and VQ clustering 
number = 64. The final inside and outside testing of 
recognition achieve 96.2% and 83. 1%. It proves that the 
proposed approach can be employed to recognize the 
speaker independent speeches. 
Future works will be studied in the following: 

1) Employing other effective methods to merging novel 
method to enhance the performance. 

2) Applying the method into isolated Chinese speech 
recognition. 

3) Improving the precision rates. 
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